4  Model Building and Selection

In this module, we explore strategies for building effective regression models. We cover: - Increasing model complexity to capture non-linear relationships and interactions, - Assessing model fit while balancing complexity, - Techniques for selecting parsimonious models.

4.1 Increasing model complexity to model complex relationships

4.1.1 Why increase model complexity?

We began this course with the simple linear regression model: a single continuous predictor with a straight-line effect. We then expanded to multiple predictors, where each predictor still had a straight-line effect. For reasons that will become clear, these are examples of so-called first-order models:

Key-term: First-order model
A linear regression where each predictor appears only once and to the first power (no squared terms or interactions). For one predictor: \[ E[Y] = \beta_0 + \beta_1 X. \] With multiple predictors, \(X_1, X_2, \dots, X_i\), this extends to \[E[Y] = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_i X_i\]

This first-order structure is the simplest linear model we can make with \(X\). It is often a reasonable starting point, and can go a long way to modeling real-world phenomena (especially in the multiple regression case), but real data are often more complex. Real relationships between predictors and outcomes are often non-linear, or involve interactions between predictors. To capture these more complex relationships, we can increase model complexity by adding ‘higher-order’ and ‘interaction’ terms.

Example 1: Non-linear relationships in data

For example, fitting a straight line to the following data is not ideal, as the relationship between \(X\) and \(Y\) is curved (U-shaped), so there is always part of the relationship the line cannot capture:

4.1.2 Square, Cubic, and Higher-Order Univariate Models

We enrich the basic linear model by adding powers of a predictor, allowing the fitted relationship to bend. In the above case, the “U” shape suggests a quadratic (squared) term may be appropriate.

Key-term: Second-order (Quadratic) univariate model
\[E[Y] = \beta_0 + \beta_1X + \beta_2X^2\]

Example 2: Seeing how \(\beta_2\) bends the curve

Adjust the slider for the squared term to see how changing \(\beta_2\) adds curvature to the fitted relationship . Try to find a good fit - can you guess what model generated this data?.

4.1.2.1 Third-order (cubic) univariate model

\[ E[Y] = \beta_0 + \beta_1X + \beta_2X^2 + \beta_3X^3 \]

allows more complex curvature with two changes in direction (an “S”-shape) that a quadratic term cannot capture.

Example 3: Seeing how \(\beta_3\) twists the curve

Adjust the slider for the cubic term to see how changing \(\beta_3\) adds an inflection point to the fitted relationship. Can you tune the parameters to recover the curve that generated this data?

4.1.2.2 N-th order univariate model

We can extend this process of deriving new terms by adding further powers of \(X\) - allowing ius to to fit arbitrarily complex curves: \[ E[Y] = \beta_0 + \beta_1X + \dots + \beta_nX^n. \]

Note that each term gets its own parameter (\(\beta_i\)). If we have \(n\) data points, a (n-1)^th-order model will have a parameter for each point, meaning it will fit the data perfectly!

Example 4: Fitting higher-order [polynomial terms]

Higher-order terms increase the model’s flexibility, but also increase the risk of fitting random noise rather than meaningful structure. We will return to this important issue later in the module.

Note: The meaning of “linear” in linear models
Even though higher-order models can model nonlinear relationships between the outcome and predictors, they are still linear models because each parameter enters additively. “Linearity” here refers to the parameters, not the shape of the fitted curve.

4.1.2.3 Fitting higher-order univariate models in R

In R, higher-order polynomial terms can be included using the I() function to indicate ‘as is’ operations. For example, to fit a quadratic model:

lm(y ~ x + I(x^2), data = nonlinearData)

Call:
lm(formula = y ~ x + I(x^2), data = nonlinearData)

Coefficients:
(Intercept)            x       I(x^2)  
     2.6185      -3.3760       0.8218  

or using the poly() function:

lm(y ~ poly(x, 2), data = nonlinearData)

Call:
lm(formula = y ~ poly(x, 2), data = nonlinearData)

Coefficients:
(Intercept)  poly(x, 2)1  poly(x, 2)2  
     0.3493       0.3941       3.0208  

4.1.3 Interaction Models with Continuous Predictors

In Chapter 2 we saw a multiple regression model with two continuous predictors:

\[ E[Y] = \beta_0 + \beta_1X_1 + \beta_2X_2. \]

This again is a first-order model - each predictor appears only once, ‘as is’. Of course, we can add higher-order terms for each predictor separately (e.g., \(X_1^2\), \(X_2^3\)) to capture curvature in their individual effects as we did above. e.g. A second-order multivariate model with two predictors might look like: \[ E[Y] = \beta_0 + \beta_1X_1 + \beta_2X_2 + \beta_3X_1^2 + \beta_4X_2^2. \]

Now however, we can also consider a different form of higher order term: what happens when we multiply predictors together? This gives us an interaction term - a term that combines two (or more) predictors.

Key-term: Interaction term
An interaction is a product of predictors that allows the effect of one predictor to depend on the level of another. For two continuous predictors, the second-order interaction model is: \[ E[Y] = \beta_0 + \beta_1X_1 + \beta_2X_2 + \beta_3X_1X_2. \] If \(\beta_3 = 0\) (i.e. as in the first-order model), predictors act independently; if \(\beta_3 \neq 0\), the slope for \(X_1\) varies with \(X_2\) and vice versa.

Example 5: Fitting an interaction model to the advertising dataset.
If we recall our plot of the advertising dataset from Section 3.2, our fist-order linear model did not fit the data in particular ways:

The first-order model is \[ E[Sales] = \beta_0 + \beta_1TV + \beta_2Radio. \tag{4.1}\]

ads <- read.csv("Data/Advertising.csv")
ads_mod <- lm(Sales ~ TV + Radio, data = ads)

The plane does not fit the data at the edges - it overestimates sales when either Radio or TV advertising is low (separately), but underestimates sales when both are high. This suggests that the effect of increasing one type of advertising depends on the level of the other type - an interaction effect.

We can fit a second-order interaction model to capture this: \[ E[Sales] = \beta_0 + \beta_1TV + \beta_2Radio + \beta_3(TV \times Radio). \tag{4.2}\]

The fitted interaction model has a charateristic ‘saddle’ shape, which can better capture the data structure:

The interaction effect can also be visualised by fixing one predictor and plotting the relationship between the other predictor and the outcome. In this case, we can plot lines representing the relationship between TV advertising and Sales when Radio advertising is low ($0 spent), average (i.e. mean(ads$Radio)= 23.264 thousand dollars ), and high (i.e. max(ads$Radio)= $49.6 thousand dollars):

We can see that the relationship between TV advertising and Sales (represented by the slope of the regression lines) is different at different levels of Radio advertising. In particular, as radio advertising increases, the slope of the line increases, indicating that TV advertising has a larger effect on Sales when Radio advertising is also high.

When our first order model has more than two predictors, we can include interaction terms between any pair (or more) of predictors. For example, with three predictors \(X_1\), \(X_2\), and \(X_3\), a second-order interaction model would include interactions between each pair: \[ E[Y] = \beta_0 + \beta_1X_1 + \beta_2X_2 + \beta_3X_3 + \beta_4X_1X_2 + \beta_5X_1X_3 + \beta_6X_2X_3. \]

4.1.3.1 Higher-order interaction models

We can also consider higher-order interactions - those that include combinations of three or more predictors. For example, a third-order interaction model with three predictors would include the ‘three-way’ interaction term: \[ E[Y] = \beta_0 + \beta_1X_1 + \beta_2X_2 + \beta_3X_3 + \beta_4X_1X_2 + \beta_5X_1X_3 + \beta_6X_2X_3 + \beta_7X_1X_2X_3. \] The \(\beta_7\) parameter here captures how the interaction between two predictors changes depending on the level of the third predictor.

As in the case of higher-order univariate models, we can extend this idea to include both higher-order interaction terms and higher-order univariate terms for each predictor, allowing for very flexible modeling of complex relationships.

However, as before, increasing model complexity in this way raises the risk of overfitting and interpretability challenges, which we will discuss later in this module.


4.1.3.2 Fitting interaction models in R

In R, interaction terms can be included using the * operator in the formula. For example, to fit a second-order interaction model with two predictors:

lm(Y ~ X1*X2*X3, data = mydata)

This expands to include both main effects and the interaction term. To include only the interaction term without main effects, use the : operator. For example, if we only wanted the interaction between X1 and X2, along with the linear terms for X1, X2, and X3:

lm(Y ~ X1+X2+X3+X1:X2, data = mydata)

4.1.4 Interaction Models with Categorical Predictors

As in Section 3.5, a categorical predictor with \(k\) levels is represented by \(k-1\) dummy variables: \[ E[Y] = \beta_0 + \beta_1X + \gamma_1D_1 + \dots + \gamma_{k-1}D_{k-1}. \]

  • Each \(\gamma_j\) measures the difference between level \(j\) and the baseline.
  • Changing the baseline changes interpretations but not fitted values.

4.1.4.1 Continuous \(\times\) categorical interactions

When a continuous predictor interacts with a factor, each group can have its own slope: \[ E[Y] = \beta_0 + \beta_1X + \gamma D + \delta (XD), \] where \(D\) indicates the non-reference group. The baseline group has slope \(\beta_1\), and the other group has slope \(\beta_1 + \delta\).

The interaction term tells us whether the relationship between weight and fuel economy differs between automatic and manual cars (an ANCOVA-style comparison).

4.1.4.2 Categorical \(\times\) categorical interactions

With two factors, an interaction means the difference between levels of one factor depends on the level of the other. This is the two-way ANOVA setting with an interaction term.

If the interaction term is significant, the transmission effect is not consistent across cylinder groups.

Note
We saw in Section 3.5 that first order models with categorical predictors are equivalent to the familiar t-tests and ANOVA models from earlier statistics courses. Interaction models with both continuous and categorical predictors are similarly analogous to “ANCOVA” models.

Exercise 1: Interpreting a continuous-by-categorical interaction

Fit mpg ~ wt * am_f on mtcars_cat.

  1. Which group has the steeper slope for weight?
  2. What does the interaction term represent in words?

4.2 Assessing model fit and model complexity

4.2.1 Why not increase model complexity?

In the previous section we saw how adding higher-order terms and interactions can help us capture complex relationships in data. However, increasing model complexity is not always beneficial. While more complex models can fit the training data (i.e. the data we use to fit our model) better, they may not generalise well to new data. This is because complex models can start to fit random noise in the training data, rather than the underlying signal. This phenomenon is known as overfitting.

In addition, more complex models can be harder to interpret, making it difficult to understand the relationships between predictors and the outcome. They may also be more sensitive to outliers and multicollinearity, leading to unstable parameter estimates.

A good model is no more complex than necessary to describe the main structure of the data. Therefore, when building regression models, we need to balance the trade-off between model fit and model complexity. In this section, we discuss some common issues that arise with complex models, and metrics for assessing model fit while accounting for complexity.

Key-point: Balancing fit and complexity
A good regression model balances: * Adequate fit to the data, * Simplicity and interpretability, * Generalisability to new data.

4.2.2 Fit Metrics

Fit metrics quantify how well a model captures patterns in the data. We have already seen one such metric: the coefficient of determination, \(R^2\).

Key-term: Coefficient of Determination (\(R^2\))
\[ R^2 = 1 - \frac{SS_{res}}{SS_{tot}} \] where \(SS_{res}\) is the residual sum of squares and \(SS_{tot}\) is the total sum of squares. \(R^2\) measures the proportion of variance in the outcome explained by the model, ranging from 0 (no explanatory power) to 1 (perfect fit).

\(R^2\) was introduced in Chapter 2 as a measure of fit for simple linear regression models based soely on the proportion of variance in \(Y\) explained by \(X\). We can also calculate \(R^2\) for multiple regression models, where it measures the proportion of variance in \(Y\) explained by all predictors jointly. However, \(R^2\) has a key limitation: it always increases (or at least does not decrease) when additional predictors are added to the model, even if those predictors do not meaningfully improve the model’s explanatory power. This can lead to overfitting, where a model appears to fit the training data well but performs poorly on new data - for example, the (n-1)th order polynomial model discussed in Section 4.1.2.2 has A SSres of zero and thus an \(R^2\) of 1, but is unlikely to generalise well.

4.2.2.1 [Adjusted \(R^2\)]

To address this limitation, we can adjust \(R^2\) to penalise the addition of unnecessary predictors. The adjusted \(R^2\) introduces a penalty term based on the number of predictors and the sample size, providing a more balanced measure of model fit that accounts for complexity.

Key-term: [Adjusted \(R^2\)]
\[ \text{Adjusted } R^2 = 1 - \left(1 - R^2\right) \frac{n - 1}{n - k - 1} \] where \(n\) is the sample size and \(k\) is the number of predictors. Adjusted \(R^2\) can decrease when unnecessary predictors are added, making it a more reliable metric than \(R^2\) for model selection. Can be computed in R using summary(lm_model)$adj.r.squared.

Adjusted \(R^2\) is particulary useful because of it’s standardised scale (0 to 1), and the interpretation of \(R^2\) as the proportion of variance explained still holds - now with the added benefit of penalising unnecessary complexity. However, it is important to note that adjusted \(R^2\) is just one of many metrics available for assessing model fit while accounting for complexity.

4.2.2.2 Information Criteria: AIC, BIC

Another common approach to balancing model fit and complexity is to use information criteria, such as the Akaike Information Criterion (AIC) and the Bayesian Information Criterion (BIC). These criteria provide a quantitative way to compare models, penalising those with more parameters.

Key-term: [Akaike Information Criterion (AIC)]
\[ \text{AIC} = 2k - 2\ln(L) \] where \(k\) is the number of parameters in the model and \(L\) is the likelihood of the model. Lower AIC values indicate a better balance between fit and complexity. Can be computed in R using AIC() function.

Key-term: [Bayesian Information Criterion (BIC)]
\[ \text{BIC} = \ln(n)k - 2\ln(L) \] where \(n\) is the sample size, \(k\) is the number of parameters, and \(L\) is the likelihood of the model. Like AIC, lower BIC values indicate a better balance between fit and complexity. Can be computed in R using BIC() function.

Both AIC and BIC penalise model complexity, but BIC generally imposes a larger penalty for additional parameters, especially in larger samples. The choice between AIC and BIC often depends on the context and goals of the analysis. AIC is generally preferred for predictive accuracy, while BIC is more conservative and favours simpler models.

Key-point: Interpreting AIC and BIC
Unlike adjusted \(R^2\), AIC and BIC do not have a fixed scale, so their absolute values are not interpretable on their own. Instead, they are used to compare models fitted to the same dataset - the model with the lowest AIC or BIC is considered the best among the candidates.

4.2.2.3 Calculating fit metrics with broom::glance()

In R, we can calculate adjusted \(R^2\), AIC, and BIC using the broom package’s glance() function, which provides a tidy summary of model fit statistics. For example, using our fitted 2nd-order interaction model of the advertising dataset:

library(broom)
glance(ads_mod_int)
# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic   p.value    df logLik   AIC   BIC
      <dbl>         <dbl> <dbl>     <dbl>     <dbl> <dbl>  <dbl> <dbl> <dbl>
1     0.968         0.967 0.944     1963. 6.68e-146     3  -270.  550.  567.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

Note that glance() returns a data frame with the mentioned fit metrics, along with other useful statistics such as the number of observations (nobs), residual standard error (sigma) and the global F-statistic (statistic). Alongside tidy(), glance() is a powerful tool for summarising and comparing regression models in R.

4.2.2.4 Comparing model fit

When comparing multiple models fitted to the same dataset, we can use adjusted \(R^2\), AIC, and BIC to assess which model provides the best balance between fit and complexity: the key points to remember are:

  • Higher adjusted \(R^2\) values indicate better fit.
  • Lower AIC and BIC values indicate better fit.

Example 6: Comparing first-order and interaction models
Lets continue our investigation of the advertising dataset from Section 3.2 by comparing our first-order model of Sales (ads_mod) to our second-order interaction model (ads_mod_int).

Lets combine the fit metrics from both models into a single table for easy comparison:

# A tibble: 2 × 12
  r.squared adj.r.squared sigma statistic   p.value    df logLik   AIC   BIC
*     <dbl>         <dbl> <dbl>     <dbl>     <dbl> <dbl>  <dbl> <dbl> <dbl>
1     0.897         0.896 1.68       860. 4.83e- 98     2  -386.  780.  794.
2     0.968         0.967 0.944     1963. 6.68e-146     3  -270.  550.  567.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
ads_fit_glance <- rbind(
  glance(ads_mod) |> `rownames<-`("First-order model"),
  glance(ads_mod_int) |> `rownames<-`("Interaction model"))

ads_fit_glance

Since the interaction model Equation 4.2 differs from the first-order model Equation 4.1 by the addition of only the \(TV \times Radio\) interaction term, we can interpret the change in fit metrics as the change in model fit due to adding this term. From the table above, we can see that the interaction model has a higher adjusted \(R^2\) and lower AIC and BIC values compared to the first-order model. This indicates that adding the interaction term improves the model’s fit to the data, even after accounting for the increased complexity.

Moreover, we can test the hypothesis that the interaction term significantly improves model fit using an ANOVA comparison:

anova(ads_mod, ads_mod_int)
Analysis of Variance Table

Model 1: Sales ~ TV + Radio
Model 2: Sales ~ TV + Radio + TV:Radio
  Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
1    197 556.91                                  
2    196 174.48  1    382.43 429.59 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The ANOVA table shows a significant p-value for the comparison, indicating that the interaction model provides a significantly better fit to the data than the first-order model. Therefore, we conclude that including the interaction term is justified in this case.


4.3 A model building workflow

So far in this chapter we have discussed how to build more complex regression models using higher-order terms and interactions, as well as how to assess model fit while accounting for complexity. These are important tools for building effective regression models, and we now turn to strategies for applying these tools in practice.

4.3.1 Exploratory data analysis and theory-driven model development

Before fitting any models, it is crucial to conduct exploratory data analysis (EDA) to understand the relationships between variables and identify potential patterns or anomalies. Visualisation and summary statistics can help identify non-linear relationships, interactions, and outliers that may inform model development. They can also highlight potential issues such as multicollinearity or heteroscedasticity that may need to be addressed (we will cover these potential pitfalls in the next chapter).

4.3.1.1 Pairs plots and correlation matrices

Pairs plots and correlation matrices are useful tools for visualising relationships between multiple variables simultaneously. They can help identify potential predictors for inclusion in the model, as well as potential interactions or non-linear relationships. For example, a pairs plot can show scatterplots of each pair of variables, along with histograms of each variable on the diagonal. A correlation matrix can show the strength and direction of linear relationships between variables. We will use the `ggpairs() function from the GGally package (an extension to ggplot2) to create pairs plots in R

library(GGally)
ggpairs(ads)

4.3.2 Automated model selection methods

4.3.3 [Stepwise regression]

Stepwise methods provide automated ways to search for simpler models.

4.3.3.1 Forward Selection

Begin with a minimal model (often the intercept). Add predictors one at a time when they improve AIC or adjusted \(R^2\).

4.3.3.2 Backward Selection

Begin with a saturated model containing all candidate predictors. Remove predictors that do not meaningfully contribute.

Note
Stepwise procedures should be treated as screening tools, not definitive modelling strategies. Final model decisions should consider diagnostics, interpretability, and subject-matter knowledge.

Example 7: Samara: comparing seeds across trees
Maple samara (seeds) fall like tiny helicopters. A forest scientist measured their fall speed (Velocity) against their ‘disk loading’ (a quantity based on their size and weight; encoded asLoad) on three trees (Tree). The goal: check whether the rate of change in velocity differs between trees.

samara <- read_csv("Data/samara.csv", show_col_types = FALSE) |>
  mutate(Tree = factor(Tree))

The slopes look similar but not identical. Perhaps tree identity matters, and perhaps the slopes differ slightly by tree. We can compare three candidate models:

Candidate models

sam_mods <- list(
  same_slope   = lm(Velocity ~ Load, data = samara),
  add_tree     = lm(Velocity ~ Load + Tree, data = samara),
  interaction  = lm(Velocity ~ Load * Tree, data = samara)
)

ANOVA and fit metrics

anova(sam_mods$same_slope, sam_mods$add_tree, sam_mods$interaction)
Analysis of Variance Table

Model 1: Velocity ~ Load
Model 2: Velocity ~ Load + Tree
Model 3: Velocity ~ Load * Tree
  Res.Df     RSS Df Sum of Sq     F  Pr(>F)  
1     33 0.21476                             
2     31 0.20344  2  0.011322 0.992 0.38306  
3     29 0.16549  2  0.037949 3.325 0.05011 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
sam_fit <- names(sam_mods) |>
 map_dfr(\(nm) glance(sam_mods[[nm]]) |>
           mutate(model = nm, formula = format(formula(sam_mods[[nm]])))) |>
 select(model, formula, adj_r2 = adj.r.squared, AIC, BIC)

sam_fit
# A tibble: 3 × 5
  model       formula                adj_r2   AIC   BIC
  <chr>       <chr>                   <dbl> <dbl> <dbl>
1 same_slope  Velocity ~ Load         0.791 -72.9 -68.3
2 add_tree    Velocity ~ Load + Tree  0.789 -70.8 -63.1
3 interaction Velocity ~ Load * Tree  0.817 -74.1 -63.2

Tree alone does not add much (p ≈ 0.38). The interaction is borderline (p ≈ 0.05) and has the lowest AIC/BIC, hinting that slopes might vary slightly by tree.

Forward vs backward stepwise selection

scope <- list(lower = ~1, upper = ~Load * Tree)
step_forward  <- step(lm(Velocity ~ 1, samara), scope = scope,
                      direction = "forward", trace = 0)
step_backward <- step(lm(Velocity ~ Load * Tree, samara), scope = scope,
                      direction = "backward", trace = 0)

c(forward  = format(formula(step_forward)),
  backward = format(formula(step_backward)))
                 forward                 backward 
       "Velocity ~ Load" "Velocity ~ Load * Tree" 
AIC(step_forward, step_backward)
              df       AIC
step_forward   3 -72.94911
step_backward  7 -74.07063

Forward selection stops at the simple Load-only model; backward selection keeps the interaction. Their AIC values are close, but the interaction edges ahead, so retain it if tree-specific slopes matter for interpretation.